8

a

library(ISLR2)
lm.fit <- lm(mpg ~ horsepower, data=Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Is there a relationship between the predictor and the response?

About 60% of the variance in mpg can be attributed to horsepower - and the F-statistic is such that we’d reject the null hypothesis (that is, there is no relationship)

How strong is the relationship between the predictor and the response?

I’m not sure how to qualify “strong” - is that the slope? Because if so it isn’t.

Is the relationship between the predictor and the response positive or negative?

Negative - an increase in horsepower lowers mpg

What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?

predict(lm.fit, data.frame(horsepower=c(98)), interval = 'confidence')
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
predict(lm.fit, data.frame(horsepower=c(98)), interval = 'prediction')
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

As expected, the prediction interval is larger.

b

library(ggplot2)
ggplot(Auto, aes(x=horsepower, y=mpg)) + geom_point(size=2, shape=23) + geom_abline(intercept = lm.fit$coefficients[1], slope = lm.fit$coefficients[2], color="red")

c

plot(lm.fit)

Potential issues: - residuals aren’t uniformly distributed, which suggests a non-linear relationship - the residuals vs leverage plot identifies some points that would impact the regression line if they were removed

lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data=Auto)
summary(lm.fit2)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16
plot(lm.fit2)

9

a

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(subset(Auto, select=-c(name)), aes(colour=as.factor(origin)))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

b

cor(subset(Auto, select=-c(name)))
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

c

lm.fit <- lm(mpg ~ ., data=subset(Auto, select=-c(name)))
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ ., data = subset(Auto, select = -c(name)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

There’s a clear relationship between predictors and the response, with R^2 explaining 82% of the variance. Some predictors are not significant like horsepower or acceleration. year is statistically siginifcant given the p-value.

d

plot(lm.fit)

The Residuals vs Leverage plot indicates a few observations with very high leverage.

TThe Residuals vs Fitted plot isn’t uniform, indicating the true relationship is unlikely to be linear.

e

lm.fit_multi <- lm(mpg ~ .^2, data=subset(Auto, select=-c(name)))
summary(lm.fit_multi)
## 
## Call:
## lm(formula = mpg ~ .^2, data = subset(Auto, select = -c(name)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

Note the difference in R with * and : (the former is additive + interaction, the other is interaction only).

Only acceleration and origin seem to be statistically significant?

f

lm.fit <- lm(mpg ~ weight + acceleration + displacement, data=subset(Auto, select=-c(name)))
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ weight + acceleration + displacement, data = subset(Auto, 
##     select = -c(name)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6583  -2.7805  -0.3571   2.4971  16.2067 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.003203   1.864930  21.986  < 2e-16 ***
## weight       -0.006174   0.000742  -8.320 1.51e-15 ***
## acceleration  0.186058   0.097970   1.899   0.0583 .  
## displacement -0.010631   0.006524  -1.630   0.1040    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.279 on 388 degrees of freedom
## Multiple R-squared:  0.7017, Adjusted R-squared:  0.6994 
## F-statistic: 304.3 on 3 and 388 DF,  p-value: < 2.2e-16
lm.fit1 <- lm(mpg ~ weight + acceleration + I(log(displacement)), data=subset(Auto, select=-c(name)))
summary(lm.fit1)
## 
## Call:
## lm(formula = mpg ~ weight + acceleration + I(log(displacement)), 
##     data = subset(Auto, select = -c(name)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5479  -2.6642  -0.3638   2.3460  16.8464 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          61.4565452  4.9950040  12.304  < 2e-16 ***
## weight               -0.0043803  0.0007195  -6.088 2.75e-09 ***
## acceleration          0.1302337  0.0896918   1.452    0.147    
## I(log(displacement)) -5.2637315  1.2019343  -4.379 1.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.191 on 388 degrees of freedom
## Multiple R-squared:  0.7138, Adjusted R-squared:  0.7116 
## F-statistic: 322.6 on 3 and 388 DF,  p-value: < 2.2e-16

log of displacement gives an increased R^2 - seems acceleration is fairly unsignificant!

10

a

library(ISLR2)
lm.fit <- lm(Sales ~ Price + Urban + US, data=Carseats)
summary(lm.fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b

Urban and US are both qualitative.

Price is negatively correlated with sales, indicating a higher price doesn’t lead to higher sales.

US is positive, indicating higher sales in US-based stores.

c

\(f(X) = 13.04 -0.54X_1 -0.022X_2 + 1.2X_3\)

d

We can reject the null hypothesis on all predictors except Urban as it’s not statistically significant.

summary(lm(Sales ~ Urban, data=Carseats))
## 
## Call:
## lm(formula = Sales ~ Urban, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5636 -2.1107 -0.0109  1.7914  8.8018 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.56356    0.26028  29.060   <2e-16 ***
## UrbanYes    -0.09537    0.30998  -0.308    0.759    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.827 on 398 degrees of freedom
## Multiple R-squared:  0.0002378,  Adjusted R-squared:  -0.002274 
## F-statistic: 0.09465 on 1 and 398 DF,  p-value: 0.7585

e

lm.carseats2 <- lm(Sales ~ Price + US, data=Carseats)
summary(lm.carseats2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Smaller RSE than with Urban

f

Very poorly! We only explain about 24% of the variance.

g

confint(lm.carseats2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h

plot(lm.carseats2)

There is some evidence of high leverage - but the Residuals vs Fitted plot is relatively uniform.

11

a

set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
lm.fit <- lm(y ~ 0 + x)
summary(lm.fit)
## 
## Call:
## lm(formula = y ~ 0 + x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

b

set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
lm.fit <- lm(x ~ 0 + y)
summary(lm.fit)
## 
## Call:
## lm(formula = x ~ 0 + y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

c

\(R^2\) is identical to the other way around! Residuals are smaller though - RSE is very different! (though I recall reading it’s scale-dependent, so whilst it does look significant it isn’t)

d

n <- length(x)
top <- sqrt(n-1) * sum(x*y)
bottom <- sqrt(sum(x^2)*sum(y^2)-sum(x*y)^2)
top/bottom
## [1] 18.72593

e

f

12

a